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Abstract — Low-complexity near-optimal signal detection in 
large dimensional communication systems is a challenge. In this 
paper, we present a reactive tabu search (RTS) algorithm, a 
heuristic based combinatorial optimization technique, to achieve 
low-complexity near-maximum likelihood (ML) signal detection 
in linear vector channels with large dimensions. Two practically 
important large-dimension linear vector channels are considered: 
i) multiple-input multiple-output (MIMO) channels with large 
number (tens) of transmit and receive antennas, and ii) severely 
delay-spread MIMO inter-symbol interference (ISI) channels 
with large number (tens to hundreds) of multipath components. 
These channels are of interest because the former offers the 
benefit of increased spectral efficiency (several tens of bps/Hz) 
and the latter offers the benefit of high time-diversity orders. 
Our simulation results show that, while algorithms including 
variants of sphere decoding do not scale well for large dimensions, 
the proposed RTS algorithm scales well for signal detection 
in large dimensions while achieving increasingly closer to ML 
performance for increasing number of dimensions. 

Index Terms — Linear vector channels, large dimensions, low- 
complexity detection, near-ML performance, V-BLAST, non- 
orthogonal STBCs, MIMO-ISI channels, UWB, severe delay 
spread, tabu search. 

I. Introduction 
Large-dimension communication systems are likely to play 
an important role in modern wireless communications, where 
dimensions can be in space, time, frequency and their combi- 
nations. Large dimensions can bring several advantages with 
respect to the performance of communication systems. For 
example, use of large number of transmit/receive antennas 
increases the number of spatial dimensions, which results 
in increased capacity JTJ,fl2|- A severely delay-spread inter- 
symbol interference (ISI) channel (i.e., large number of echoes 
of the transmitted signal in time dimension), as witnessed in 
ultrawideband (UWB) systems, can provide the opportunity 
for increased time-diversity 0. Harnessing such benefits of 
large-dimensions in practice, however, is challenging. In par- 
ticular, optimum receiver complexity can become practically 
infeasible in large dimensions. Consequently, low-complexity 
receiver techniques/algorithms that scale well for large dimen- 
sions while achieving near-optimal performance are of interest. 
It has been found that many modern meta-heuristic algorithms 
give near-optimal performance at a much reduced complexity 
[4|. In this paper, we report one such heuristic based on tabu 
search [5|,[6|, and illustrate its near-optimal performance in 
two practically important large dimension systems, namely i) 
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a iarge-MIMO system' with tens of transmit/receive antennas 
(with a motivation to achieve high spectral efficiencies), and 
ii) a severely delay-spread MIMO UWB system with tens 
to hundreds of multipath components (with a motivation to 
achieve high time-diversity orders). 

Tabu search (TS) is a heuristic originally designed to obtain 
approximate solutions to combinatorial optimization problems 
|5|-|8]. TS is increasingly being applied in communication 
problems JU-JU]]. For e.g., in J5), design of constellation 
label maps to maximize asymptotic coding gain is formulated 
as a quadratic assignment problem, which is solved using a 
reactive TS (RTS) strategy [8|. RTS approach is shown to 
be effective in terms of bit error performance and efficient 
in terms of computational complexity in CDMA multiuser 
detection flU. In (Til, a fixed TS based detection in V-BLAST 
is presented for small number of antennas. A key objective in 
this paper is to propose a reactive tabu search based approach 
to seek approximately maximum-likelihood (ML) solutions 
in large dimension problems (but with significantly lower 
computational complexity than that of the true ML solution) 
in linear vector channels (LVC) in general, and to establish its 
performance and complexity in two interesting communication 
systems in particular. 

The first communication system we consider is a large- 
MIMO system that employs tens of transmit antennas to 
achieve high spectral efficiencies - e.g., a 16 x 16 V-BLAST 
system with 16-QAM and rate-3/4 turbo code can achieve 
a spectral efficiency of 48 bps/Hz. We show that the RTS 
algorithm achieves increasingly closer to ML performance 
for increasing number of transmit antennas (we refer to this 
behavior of the algorithm as the 'large-dimension behavior'). 
For e.g., in a 64 x 64 V-BLAST system with 4-QAM, RTS 
is shown to achieve 10~ 3 uncoded BER at an SNR of just 
0.4 dB away from single-input single-output (SISO) AWGN 
performance. We present a comparison of the performance 
and complexity of RTS with those of low-complexity vari- 
ants of sphere decoders (SD), including a suboptimal fixed- 
complexity SD (FSD) reported in (12). In a 32 x 32 V-BLAST 
system with 4-QAM, RTS is shown to perform better than 
FSD by about 1.5 dB at 1CT 2 uncoded BER. Interestingly, 
RTS achieves this better performance at about an order less 
complexity than FSD. We also show that RTS can achieve 
near-ML performance in decoding large non-orthogonal space- 
time codes (STBCs) from cyclic division algebras (CDA), 
which can offer full transmit diversity in addition to achieving 
full rate as in V-BLAST JTfl.lfBl. 

The second communication scenario considered is equal- 
ization in severely delay-spread MIMO-ISI UWB channels 
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with large number of multipath components (MPC). Commu- 
nication systems using UWB techniques typically have very 
high transmission bandwidths to accommodate very high data 
rates [3|. Such UWB channels are characterized by severe 
ISI due to large delay spreads [17|-[20j. The number of 
MPCs in indoor and industrial environments has been observed 
to be of the order of several tens to hundreds; number of 
MPCs ranging from 12 to 120 are common in UWB channel 
models fl7l . l20l . These MPCs, if carefully exploited, can 
provide the opportunity to achieve increased time-diversity 
benefits ifTTl . Algorithms based on likelihood ascent search 
(LAS)/bit flipping (221,(231, (28) and factor graphs E2 have 
been proposed for equalization in such systems. We show 
that the proposed RTS algorithm achieves increasingly close 
to optimal performance for increasing number of MPCs, and 
achieves better performance due its inherent escape strategy 
from local minima. 

The rest of the paper is organized as follows. The proposed 
RTS algorithm for detection in linear vector channels is 
presented in Section [TT] BER performance and complexity of 
the RTS algorithm in comparison with those of other detectors 
including variants of sphere decoders are presented in Sections 
iHll to [V] Conclusions are presented in Sections [Vl] 

II. Proposed RTS Based Detection in LVCs 

We consider linear vector channels where a d t -dimensional 
input vectoiQ x G A dt (A denotes a finite set from the complex 
field) is linearly transformed by a d r x d t channel transfer 
matrix, H S C drXdt , and is corrupted by a d r -dimensional 
noise vector, n S C dr , so that the d r -dimensional output 
vector, y £ C dr , is given by 



y = Hx + n. 



(1) 



In communication systems, x and y can be the transmitted 
and received signal vectors, respectively, and the goal is to 
obtain an estimate of the transmitted vector x, given y and the 
knowledge of H. When the noise is Gaussian, the maximum- 
likelihood (ML) detection rule is given by 



Xj\/L = 

where 0(x) 
complexity 
for large dt 
for large d t 
we present 
complexity 
large d t . 
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= x fi H H Hx - 25R (y H Hx). The computational 
in (O is exponential in d t , which is prohibitive 
. Our interest is to achieve near-ML performance 
at low complexities. In the following subsection, 
a RTS based detection algorithm which is a low- 
iterative local search algorithm suited well for 



A. RTS Algorithm 

The RTS algorithm starts with an initial solution vector, 
defines a neighborhood around it (i.e., defines a set of neigh- 

1 Notation: Vectors and matrices are denoted by boldface lowercase letters 
and boldface uppercase letters, respectively. (.)*, [.] T , and [.]-" denote 
conjugation, transpose and Hermitian operations, respectively. |.| denotes the 
absolute value operator. A.(i,j) denotes the element in the ith row and jth 
column of matrix A. a; denotes the ith element of the vector a. SR(.) and 
denote the real and imaginary parts of a complex argument, and j = 1. 
I n denotes the n X n identity matrix. 



boring vectors based on a neighborhood criteria), and moves 
to the best vector among the neighboring vectors (even if 
the best neighboring vector is worse, in terms of likelihood, 
than the current solution vector; this allows the algorithm 
to escape from local minima). This process is continued for 
a certain number of iterations, after which the algorithm is 
terminated and the best among the solution vectors in all the 
iterations is declared as the final solution vector. In defining 
the neighborhood of the solution vector in a given iteration, 
the algorithm attempts to avoid cycling by making the moves 
to solution vectors of the past few iterations as 'tabu' (i.e., 
prohibits these moves), which ensures efficient search of 
the solution space. The number of these past iterations is 
parametrized as the 'tabu period.' The search is referred to as 
fixed tabu search if the tabu period is kept constant. If the tabu 
period is dynamically changed (e.g., increase the tabu period 
if more repetitions of the solution vectors are observed in the 
search path), then the search is called reactive tabu search. 
We consider reactive tabu search in this paper because of its 
robustness (choice of a good fixed tabu period can be tedious). 

Neighborhood Definition: Let M denote the cardinality of 
A = {ai, a 2 , • • • , a M }- Define a set Af(a q ), q G {1, • • • , M}, 
as a fixed subset of A\a q , which we refer to as the symbol- 
neighborhood of a q . We choose the cardinality of this set 
to be the same for all a q , q — 1, ••■ , M; i.e., we take 
l-^( a <?)l — N, ^1- Note that the maximum and minimum 
values of N are M — 1 and 1, respectively. We choose the 
symbol neighborhood based on Euclidean distance, i.e., for 
a given symbol, those N symbols which are the nearest will 
form its neighborhood; the nearest symbol will be the first 
neighbor, the next nearest symbol will be the second neighbor, 
and so on. For e.g., A = {-3,-1,1,3} for 4-PAM, and 
choosing N to be 2, Af(-3) = {-1,1}, Af(-l) = {-3,1}, 
7V(1) = {-1,3}, 7V(3) = {1,-1} are possible symbol- 
neighborhoods. Let w v (a q ), v = l,--- ,N denote the vth 
element in Af(a q ); i.e., we say w v (a q ) is the vth symbol- 
neighbor of a q . 



Let x(™) = [x\ 



(m) (m) 



„( m ) 
dt 



] denote the data vector 



x )' belonging to the solution space in the mth iteration, where 



We refer to the vector 



z (m) (v) = [4 m) (u, v) zi m) (u,v)---z d f (u, v)](3) 

as the (u, v)th vector-neighbor (or simply the (u, v)th 
neighbor) of x< m ), u = 1, • • • , d t , v = 1, • • • , N, if i) x< m ) 
differs from z( m )(u, v) in the Mth coordinate only, and ii) the 
itth element of z( m ) (u, v) is the vth symbol-neighbor of Xu . 
That is, 



( m ) i \ 
z\ '(u,v) 



T (m) 
w v (x 



(ro)x 



for i ^ u 
for i = u. 



(4) 



So we will have dt N vectors which differ from a given vector 
in the solution space in only one coordinate. These dtN 
vectors form the neighborhood of the given vector. We note 
that neighborhood definition based on bit-flipping |21|,|22] 
is a special case of the above neighborhood definition for 
M = 2, N = 1. An operation on x( m ) which gives x' m+1 ' 
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belonging to the vector-neighborhood of 

x (m) j s 

called a 

move. The algorithm is said to execute a move (u, v) if 
x (m+i) _ z ( m )( u ^ v y We note that the number of candidates 
to be considered for a move in any one iteration is dtN, Also, 
the overall number of 'distinct' moves possible is d t MN, 
which is the cardinality of the union of all moves from all 
M dt possible solution vectors. The tabu value of a move, 
which is a non-negative integer, means that the move cannot 
be considered for that many number of subsequent iterations, 
unless certain conditions are satisfied. 

Tabu Matrix: A tabujnatrix T of size dtM x N is the 
matrix whose entries denote the tabu values of moves. For each 
coordinate of the solution vector (there are dt coordinates), 
there are M rows in T, where each row corresponds to 
one symbol in the modulation alphabet A; the indices of the 
rows for the uth coordinate are from (u — 1)M + 1 to uM, 
u G {1, ■ ■ ■ , d t }- The N columns of T correspond to the 
N symbol-neighbors of the symbol corresponding to each 
row. In other words, the (r, s)th entry of the tabujnatrix, 
r = 1, • • • , dtM, s — 1, • ■ ■ , N, corresponds to the move 
(it, v) from x( m ) when u — L^xrJ + l,v = 8 and — a q , 
where q = mod(r— 1, M) + 1. The entries of the tabu matrix, 
which are non-negative integers, are updated in each iteration, 
and they are used to decide the direction in which the search 
proceeds (as described in the algorithm description below). 

Algorithm: Let g( m ) be the vector which has the least 
ML cost found till the mth iteration of the algorithm. Let 
l rep be the average length (in number of iterations) between 
two successive occurrences of a solution vector (repetitions). 
Tabu period, P, a dynamic non-negative integer parameter, is 
defined as follows: if a move is marked as tabu in an iteration, 
it will remain as tabu for P subsequent iterations unless the 
move results in a better solution. A binary flag, If lag 6 {0, 1}, 
is used to indicate whether the algorithm has reached a local 
minima in a given iteration or not; this flag is used in the 
evaluation of the stopping criterion of the algorithm. The 
algorithm starts with an initial solution vector x' ', which, 
for e.g., could be the MMSE or matched filter output vector. 
Set g<°) = x(°), l rep = 0, and P = P . All the entries of 

the tabujnatrix are set to zero. Define ymf = H ff y, and 
R = H ff H. Compute ymf and R. The following steps 1) to 
3) are performed in each iteration. Consider mth iteration in 
the algorithm, m > 0. 

Step I): Initialize I flag = 0. Define = RxW-y MF . 
Lete = z( m )(u, v)~yS m \ The ML costs of the d t N neighbors 
of x( m ', namely, z' m ) (u, v), u = 1, • • • ,d t , v — 1, ■ ■ ■ , TV, are 
computed as 

</>(z (m) (u^)) = (x' m > +e) H R(x( m > +e) - 23? ((x' m > +e) H y MF ) 
= ^(x( m >) + 23? (e ff Rx( m >) + e H Re - 23? (e H y MF ) 
= <Mx (m) ) + 23? (e H (Rx'" 1 ' - ymf)) + e E Re 
= <^(x( m >) + 23?(e H f( m >) + e H Re 

= 4>(^ m) ) + 23? ( e* /< m) ) + | e u \ 2 R(u, u) , (5) 

" v ' 

= C(u,v) 



where the last step follows since only one coordinate of e is 
non-zero, and R(u,it) is the (u, it)th element of R. <fi(xS m ') 
on the RHS in (O can be dropped since it will not affect the 
cost minimization. Let 



arg mm 

(ui,vi) = C(u,v). 

U, V 



(6) 



The move (ui, v{) is accepted if any one of the following two 
conditions is satisfied: 



(til,«l)) < 0( g M ) 

T((«i-l)M + g,i;i) = 0, 



(7) 
(8) 

where q is such that a q — x^\a q e A. If move (ui,vi) is 
not accepted (i.e., neither of the conditions in (jT) and (0 is 
satisfied), find (1/2,^2) such that 



(u 2 ,v 2 ) 



C(u,v), 



(9) 



arg mm 

and check for acceptance of the (it2, 1*2) move. If this also can- 
not be accepted, repeat the procedure for (ii3,vs), and so on. 
If all the dtN moves are tabu, then all the tabujnatrix entries 
are decremented by the minimum value in the tabujnatrix ; 
this goes on till one of the moves becomes acceptable. Let 
(u',v ! ) be the index of the neighbor with the minimum cost 
for which the move is permitted. Make 



x (m+l) _ z (m) 



(u',v'). 



(10) 



The variables q', q", v" are implicitly defined by ay = xffi — 
w v ,,(x ( ^ l+1 ' 1 ), and a q n = x^, l+1 \ where a q t,a q » S A. It is 
noted that in this Step 1 of the algorithm, essentially the best 
permissible vector-neighbor is chosen as the solution vector 
for the next iteration. 

Step 2): The new solution vector obtained from Step 1 is 
checked for repetition. For the linear vector channel model in 
(|T), repetition can be checked by comparing the ML costs of 
the solutions in the previous iterations. If there is a repetition, 
the length of the repetition from the previous occurrence is 
found, the average length, l rep , is updated, and the tabu period 
P is modified as P = P+l. If the number of iterations elapsed 
since the last change of the value of P exceeds f3l rep , for a 
fixed (3 > 0, make P = max(l, P — 1). After a move (u' , v') 
is accepted, if </>(x (m+1 )) < 0(g (m) ), make 



T((u' -1)M + q',v') = T((u'-1)M + <?",«") 



0, 



(11) 

(12) 



else 

T({u - l)M + q',v' 



T((u - l)M + q", v") = P+l, (13) 
If lag = 1 



(m+l) _ (m) £^ 



It is noted that this Step 2 of the algorithm implements the 
'reactive' part in the search, by dynamically changing P. 

Step 3): Update the entries of the tabujnatrix as 

T(r,s) = max{T(r,s) - 1,0}, (15) 
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for r = 1, • • • , dtM, s = 1, • • • ,N, and update f (m) as 

f (m+i) = fH + ^W^,')-^)^'. (16) 

where R u / is the u'th column of R. The algorithm terminates 
in Step 3 if the following stopping criterion is satisfied, else 
it goes back to Step 1. 

Stopping criterion: The algorithm can be stopped based 
on a fixed number of iterations. Though convergence can be 
slow at low SNRs, it can be fast at moderate to high SNRs. 
So rather than fixing a large number of iterations to stop the 
algorithm irrespective of the SNR, we use an efficient stopping 
criterion which makes use of the knowledge of the best ML 
cost found till the current iteration, as follows. Since the ML 
criterion is to minimize ||Hx — y|| 2 , the minimum value of 
the objective function </>(x) is always greater than —y H y. We 
stop the algorithm when the least ML cost achieved in an 
iteration is within certain range of the global minimum, which 
is —y H y. We stop the algorithm in the mth iteration, only if 
If lag = 1 and the condition 



\Hs {m) )-(-y H y)\ 



y H y\ 



(17) 



is met with at least min_iter iterations being completed to 
make sure the search algorithm has 'settled.' The bound is 
gradually relaxed as the number of iterations increase and the 
algorithm is terminated when 

l<Kg (m) )-(-y H y)l 



y H y\ 



< m«2- 



(18) 



In dl7l > and (TT81 . ot\ and «2 are positive constants. In addition, 
we terminate the algorithm whenever the number of repetitions 
of solutions exceeds maxjrep. Also, the maximum number of 
iterations is set to max_iter. 

B. RTS algorithm versus LAS algorithm 

It is noted that the likelihood ascent search (LAS) algorithm 
presented in 121 1- 11231 is also a local neighborhood search 
based algorithm, where the basic definition of neighborhood 
is the same as in RTS. However, LAS differs from RTS in the 
following aspects: i) while the definition of neighborhood is 
static in LAS for all iterations, in RTS, in addition to the basic 
neighborhood definition, there is also a dynamic aspect to the 
neighborhood definition by way of prohibiting certain vectors 
from being included in the neighbor list (implemented through 
repetition checks/tabu period), and ii) while LAS gets trapped 
in the local minima that it first encounters and declares this 
minima to be the final solution vector, RTS can potentially 
find better minimas because of the escape strategy embedded 
in the algorithm (by way of allowing to pick and move to the 
best neighbor even if that neighbor has a lesser likelihood than 
the current solution vector). 

It is further noted that a general version of LAS reported in 
[23 1, termed as multistage LAS (MLAS), executes a different 
escape mechanism when it encounters a local minima, by 
changing the neighborhood definition: it considers vectors 
which differ in two or more coordinates (as opposed to 



only one coordinate in the basic neighborhood definition) as 
neighbors. On escaping from a local minima, the algorithm 
reverts back to the basic neighborhood definition till the next 
local minima is encountered and stops when no escape from 
a local minima is possible. Since the performance gain of 
MLAS compared to LAS is found to be small, we limit our 
comparison of RTS with only LAS. Our simulation results for 
the systems considered in Sections [ill] to [V] show that RTS 
performs better than LAS. 

III. RTS Performance in Large V-BLAST Systems 

Consider a V-BLAST MIMO system with N t transmit and 
N r receive antennas. For this system, in the received signal 
model in (HJ, x G A Nt is the transmitted symbol vector, where 
A is the modulation alphabet, H e (£N r xN t j s ^ cnanne j g am 
matrix whose entries are modeled as CAf(0, 1), y G C Nr is 
the received signal vector, and n G C Nr is the noise vector 
whose entries are modeled as i.i.d CN(Q, a 2 = N *^ s ), where 
E s is the average energy of the transmitted symbols and 7 is 
the average received SNR per receive antenna. We rewrite the 
complex system model in (fTJ as a real-valued system as 



y = Hx + n, 



(19) 



where 



H 



»(H) - 


3(H) " 




" R(y) " 


3(H) 


»(H) 


. y = 




x = 


' 3?(x) " 
. 3 ( x ) . 




" K(n) " 



(20) 



We apply the RTS algorithm on the real-valued system model 
in ( fl9l ) and estimate the transmitted symbol vector. We note 
that the transmit and receive dimensions in the linear vector 
channel in ( fl9] > are d t = 2N t and d r = 2N r . 

In this section, we present the uncoded BER performance 
of RTS based detection of V-BLAST signals. Since the RTS 
algorithm is a heuristic, analytical evaluation of the BER and 
convergence behavior is difficult. So we evaluate the BER 
and convergence performance of the RTS algorithm through 
simulations. The following RTS parameters are used in the 
simulations for 4-QAM: MMSE initial vector, P = 2,f3 = 
0.1, Qi = 5%,a2 = 0.05%, maxjrep = 75, minjter = 20. Perfect 
channel state information at the receiver (CSIR) and i.i.d. 
fading are assumed. 

A. Convergence behavior of RTS in V-BLAST 

In Fig. Q] we plot the BER performance of the RTS 
algorithm as a function of maximum number of iterations, 
maxjter, in 8 x 8, 16 x 16, 32 x 32, and 64 x 64 V-BLAST 
systems with 4-QAM at an average SNR of 10 dB. Two main 
observations can be made from Fig. U i) for the system 
parameters considered, the BER converges (i.e., change in 
BER between successive iterations becomes very small) for 
max_iter greater than 300, and ii) the converged BER of 
RTS exhibits large-dimension behavior (i.e., converged BER 
improves with increasing 7V t = N r ); e.g., the converged BER 
improves from 8.3 x 10" 3 for 8 x 8 V-BLAST to 1.3 x 10~ 3 
for 64 x 64 V-BLAST. This improvement is quite significant 
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Fig. 1. Uncoded BER performance of the RTS algorithm as a function of 
maximum number of iterations, max_iter, in 8 X 8, 16 X 16, 32 X 32, and 
64 X 64 V-BLAST with 4-QAM at SNR = 10 dB. 

considering that the BER in SISO AWGN channel itself is 
7.8 x 1CT 4 for 4-QAM. We use maxjter to be 300 for 4- 
QAM in all the subsequent simulations in this section. 



B. RTS versus LAS performance in V-BLAST 

We next present the BER performance of the RTS algorithm 
in comparison with that of the LAS algorithm presented in 
[23 1. Figure |2] shows the BER performance of RTS and LAS 
algorithms for 16 x 16, 32 x 32 and 64 x 64 V-BLAST with 4- 
QAM. It can be seen that for the number of dimensions (i.e., 
N t ) considered, RTS performs better than LAS; e.g., LAS 
requires 128 real dimensions (i.e., 64 x 64 V-BLAST with 4- 
QAM) to achieve performance close to within 1.8 dB of SISO 
AWGN performance at 10 -3 BER, whereas RTS is able to 
achieve even better closeness to SISO AWGN performance 
with just 32 real dimensions (i.e., 16 x 16 V-BLAST with 4- 
QAM). Also, in 64 x 64 V-BLAST, RTS achieves 10" 3 BER at 
an SNR of just 0.4 dB away from SISO AWGN performance. 
We note that RTS is able to achieve this better performance 
because, while the bit/symbol-flipping strategies are similar 
in both RTS and LAS, the inherent escape strategy in RTS 
allows it to move out of local minimas and move towards better 
solutions. Consequently, RTS incurs some extra complexity 
compared to LAS as detailed in the following subsection. 

C. Complexity of RTS in V-BLAST 

Here, we present the complexity of the RTS algorithm 
for detection in V-BLAST. The total complexity comprises 
of three main components, namely, i) computation of the 
initial solution vector y> >, ii) computation of H T H, and 
Hi) the reactive tabu search operation. The MMSE initial 
solution vector can be computed in 0(N^N r ) complexity, 
i.e., in 0(N t N r ) per-symbol complexity since there are Nt 
symbols per channel use. Likewise, the computation of H T H 
can be done in 0{N t N r ) per-symbol complexity. We note 
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Fig. 2. Uncoded BER performance of RTS detection of 16 X 16, 32 X 32 
and 64 X 64 V-BLAST signals with 4-QAM. 

that, since computation of x^ ' and H T H are needed in both 
RTS and LAS, the complexity components i) and ii) will be 
same for both these algorithms. We further note that, while 
the complexity components i) and ii) are deterministic, the 
component Hi), which is due to the search part alone, is ran- 
dom, and so we obtained the average complexity of component 
Hi) through simulations. Figure [3] shows the complexity plots 
for the search part alone (i.e., component in)) as well as the 
overall complexity plots of the RTS and LAS algorithms for 
V-BLAST with N t = N r and 4-QAM at a BER of 10~ 2 . 
From Fig. [5J it can be observed that the RTS search part has a 
higher complexity than the LAS search part. This is expected, 
because the RTS can escape from a local minima and and 
look for better solutions, whereas LAS settles in the first local 
minima itself. However, it can be seen that since the overall 
complexity is dominated by the computation of H T H and 
io ', the difference in overall complexity between RTS and 
LAS is not high. 

D. Comparison with variants of sphere decoders in V-BLAST 

In Fig. 21 we present a uncoded BER comparison of the 
RTS detector with the fixed-complexity sphere decoder (FSD) 
presented in (H for V-BLAST with N t = N r = 4, 8, 16, 32 
and 4-QAM. The performance of the reduced-complexity 
sphere decoder (RSD) presented in |[T3ll is also plotted for 
N = N r = 4, 8, 16. We did not evaluate the performance of 
RSD for N — N r = 32 due to its high complexity. Comparing 
the performances of FSD, RSD and RTS in Fig. [4] we observe 
the following: 

1) Since the complexity of FSD is forced to be constant, 
the performance of FSD is compromised at low/medium 
SNRs compared to that of RSD (e.g., see plots for Nt = 
N r = 16, where RSD performs better than FSD by about 
1 dB at 10~ 2 BER). 

2) Performance of RTS is very close to that of RSD (see 
plots of RSD and RTS for N t = N r = 16). RTS 
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Fig. 3. Complexity comparison of RTS and LAS algorithms in detection 
of V-BLAST signals with 4-QAM at 10" 2 BER. 

achieves such good performance in large dimensions at a 
significantly lesser complexity compared to that of RSD 
(see complexity comparison in Table 1 for 16 x 16 V- 
BLAST). 

3) For large number of antennas (e.g., N t — N r = 32), 
RSD complexity becomes prohibitively high, and so we 
do not show its performance for 32 x 32 V-BLAST. How- 
ever, we have shown the FSD and RTS performances 
for 32 x 32 V-BLAST. It is seen that RTS performs 
significantly better than FSD (by about 1.5 dB at 1CT 2 
BER); this is due to the sub-optimum nature of FSD 
that arises because of fixing its complexity, and due 
to the large-dimension behavior advantage of RTS. In 
addition, RTS achieves this better performance than FSD 
at a significantly lesser complexity compared to that of 
FSD (see details in the complexity comparison text in 
the following paragraphs and the 32 x 32 system entries 
in Table 1). 

Complexity comparison between RTS and FSD in V- 
BLAST: The FSD algorithm in lTT2l has two parts; an ordering 
part (similar to that in V-BLAST algorithm) and a search 
part. The complexity of the search part, which is random 
in conventional SD, is made constant in FSD by fixing the 
number of search candidates irrespective of the SNR. The 
ordering part has 0(Nf) complexity in Nt. Also, the algorithm 
has 0(Mr^/^ _1 T) complexity in M (i.e, alphabet size) for 
N — N r [12]. On the other hand, while RTS also has 0(N?) 
complexity in Nt in a V-BLAST system, its complexity in M 
is just O(MNt) since at most (M — l)Nt neighbors need to 
be considered. The exponential complexity of FSD in \fN~ t 
makes it increasingly prohibitive for increasing Nt. For e.g., 
for N t = N r = 32 and 16-QAM, the complexity of FSD, 
which is dominated by OiM^^^ ), is 0(16 5 ) = 0(2 20 ). 
For the same system settings, the RTS complexity is dominated 
by O(Af), which is 0(32 3 ) = 0(2 15 ). The differential in 
complexity between RTS and FSD (in favor of RTS) widens 



Fig. 4. Comparison of uncoded BER performance of V-BLAST using 
RTS detection versus fixed-complexity sphere decoding in ['12'] and reduced 
complexity sphere decoder in 1131 for Nt = N r = 4,8,16,32 and 4- 
QAM. RSD performance for 32 X 32 V-BLAST is not shown due to its 
high complexity. 

further if 64-QAM is considered. 

A complexity comparison along with performance compar- 
ison between different detectors is shown in Table 1, where 
we have presented the per-symbol complexity (measured in 
number of real operations) and the SNR required to achieve 
an uncoded BER of 1(T 2 in 4 x 4, 8 x 8, 16 x 16 and 32 x 32 V- 
BLAST systems with 4-QAM. From Table 1, we see that the 
complexity of FSD for 32 x 32 V-BLAST is about an order 
higher compared to that of RTS, due to the 0(M^^~^) 
complexity of FSD. Also, even with this higher complexity, 
FSD achieves poorer performance than RTS (i.e., FSD needs 
about 1.5 dB more SNR than required by RTS to achieve 10 -2 
BER), as described earlier. 



E. Higher-Order QAM Performance in V-BLAST 

In Fig. [5] we illustrate the performance of RTS for 
higher-order QAM in a 32 x 32 V-BLAST system (16- 
QAM and 64-QAM at spectral efficiencies of 128 bps/Hz 
and 192 bps/Hz). We do not give the performance of 
FSD and RSD due to their high complexities for the 
considered values of Nt and M. As we mentioned earlier, 
FSD complexity for N t = N r = 32 and M = 64 would 
be 0(64^- x l) = 0(2 30 ), which is prohibitive. The 
complexities of RTS and LAS, on the other hand, scale well 
for such large dimensions, allowing us to show their simulated 
BER performance in Fig. [5] The following RTS parameters are 
used in the simulations: MMSE initial vector, Po = 2,/3 = 0.01; 
(N = 3, Oix = 0.3%, Oi2 = 0.001%, max_rep = 250 , minjter = 
30,max_iter = 1000) for 16-QAM, and (N = 2, a x = 0.005%, 
Qf2 = 0.00005%, maxjrep — 1000 , minjter = 50, maxjter 

3 for 64-QAM. The plots in Fig. show that RTS 

performs better than LAS by about 6 dB at 10~ 2 BER for 
16-QAM and 64-QAM. 
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10.8 dB 
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9 dB 









TABLE I 

COMPLEXITY AND PERFORMANCE COMPARISON OF THE RTS ALGORITHM WITH THE FSD ALGORITHM IN fl"2l AND THE RSD ALGORITHM IN fl3l IN 
4 X 4, 8 X 8, 16 X 16 AND 32 X 32 V-BLAST WITH 4-QAM. RTS OUTPERFORMS FSD IN TERMS OF COMPLEXITY AND PERFORMANCE FOR LARGE 
DIMENSIONS (E.G., 32 X 32). FOR LARGE M AND LARGE N t , COMPLEXITY OF FSD GETS PROHIBITIVELY HIGH. 



-A- LAS, 32x32 VBLAST, 64-QAM 
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Fig. 5. Uncoded BER performance of RTS and LAS algorithms for 32 X 32 
V-BLAST system with 16-QAM and 64-QAM. FSD and RSD performances 
are not shown due to their high complexities. 



IV. RTS Performance in Large Non-Orthogonal 
STBCs 

Large-MIMO systems that employ non-orthogonal STBCs 
from CDA 03,021 are attractive because these STBCs can 
simultaneously provide both full rate (i.e., Nt complex sym- 
bols per channel use, which is the same as in V-BLAST) as 
well as full transmit diversity (V-BLAST does not provide 
transmit diversity). The 2x2 Golden code is a well known 
non-orthogonal STBC from CDA for 2 transmit antennas 1 16 1. 
A non-orthogonal STBC from CDA is a Nt x Nt matrix whose 
entries are formed using linear combinations of various data 
symbols fl4l . Each STBC matrix is constructed using N^ data 
symbols, which are sent in using N t transmit antennas in Nt 
channel uses. The received signal matrix can be vectorized 
and written in an equivalent real system model of the form 
dl9t , where the number of transmit and receive dimensions 
are d t = 2iV t 2 and d r — 2N t N r , respectively, for QAM 



High spectral efficiencies can be achieved using large non- 
orthogonal STBCs from CDA. For e.g., a 16 x 16 STBC 
from CDA has 256 complex symbols in it with 572 real 
dimensions; with 16-QAM and rate-3/4 turbo code, this system 
offers a high spectral efficiency of 48 bps/Hz. Variants of 
sphere decoding (e.g., FSD lfl2l and RSD [13]) do not 
scale well to decode signals with hundreds of dimensions^. 
In J23, we have shown that the LAS algorithm can scale 
well to such hundreds of dimensions while achieving good 
performance. In this section, we show that RTS also scales 
well in complexity in decoding large non-orthogonal STBCs 
from CDA having hundreds of dimensions, while achieving 
even better performance than LAS. 

RTS complexity in decoding non-orthogonal STBCs from 
CDA: Here again, H T H computation complexity dominates 
the overall complexity compared to the search complexity. 
Note that there 2N? transmit and 2N t N r receive dimensions, 
and N? symbols per STBC. Exploiting the permutation nature 
of the weight matrices of the non-orthogonal STBCs from 
CDA 1 23 1, the per-symbol complexity of computing H T H, 
and hence the overall per-symbol complexity in RTS decoding 
of non-orthogonal STBCs from CDA is 0{N?N r ). 

In the following subsections, we present the BER per- 
formance of RTS in decoding non-orthogonal STBCs. The 
following parameters are used in the simulations for 4-QAM: 
MMSE initial vector, P = 2,(3 = l,ai = 5%,a 2 = 
0.05%, max_rep — 75, min_iter — 20, max_iter — 300. 

A. RTS versus LAS performance in decoding non- orthogonal 
STBCs 

In Fig. |6l we plot the uncoded BER of the RTS algorithm 
as a function of average received SNR in decoding 4x4 
(32 dimensions), 8x8 (128 dimensions) and 12 x 12 (288 
dimensions) non-orthogonal STBCs from CDA for 4-QAM 
and Nt = N r . Perfect CSIR and i.i.d fading are assumed. 
For the same settings, performance of the LAS algorithm is 

2 Since FSD and RSD complexities are prohibitive to decode signals with 
hundreds of dimensions, we do not present the performance of FSD and RSD 
for large non-orthogonal STBCs. 
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Fig. 6. Uncoded BER of RTS decoding of 4 X 4, 8 X 8, and 12 X 12 
non-orthogonal STBCs from CDA for N t = N r and 4-QAM. 

also plotted for comparison. MMSE initial vector is used in 
both RTS and LAS. As a lower bound on performance, we 
have plotted the BER performance on a SISO AWGN channel 
as well. From Fig. [6] it can be observed that the BER of 
RTS improves and approaches SISO AWGN performance as 
N t = N r (i.e., STBC size) is increased; e.g., with 12 x 12 
STBC having 288 dimensions, RTS decoding is able to achieve 
close to within 0.4 dB from SISO AWGN performance at 
1CT 3 uncoded BER. Also, as in the case of V-BLAST, RTS is 
found to perform better than LAS in decoding non-orthogonal 
STBCs as well. In the case of 16-QAM also, RTS performs 
better than LAS as can be seen in Fig. ??, where the following 
parameters are used in the simulations: MMSE initial vector, 
P = 2,/? = 1,N = 3,ai = 0.1%, a 2 = 0.002%, max_rep = 
75, min_iter — 30, max_iter = 800. 

B. Turbo coded BER performance of RTS 

Figure [7] shows the rate-3/4 turbo coded BER performance 
of RTS decoding of 12 x 12 non-orthogonal STBC from CDA 
with Nt = N r and 4-QAM (corresponding to a spectral 
efficiency of 18 bps/Hz), under perfect CSIR and i.i.d fading. 
The theoretical minimum SNR required to achieve 18 bps/Hz 
spectral efficiency on a Nt — N r = 12 MIMO channel with 
perfect CSIR and i.i.d fading is 4.27 dB (obtained through 
simulation of the ergodic MIMO capacity formula 1241 ). From 
Fig. |7] it is seen that RTS decoding is able to achieve 
vertical fall in coded BER close to within about 5 dB from 
the theoretical minimum SNR, which is a good nearness to 
capacity performance. This nearness to capacity can be further 
improved by 1 to 1 .5 dB if soft decision values, proposed in 
||231 . are fed to the turbo decoder. Also, the performance of 
RTS is about 1 dB better than that of LAS at 2 x 1CT 4 coded 
BER for the same system settings. 

C. Iterative RTS Decoding/Channel Estimation 

Next, we relax the perfect CSIR assumption by considering 
a training based iterative RTS decoding/channel estimation 



Fig. 7. Turbo coded BER of RTS decoding of 12 X 12 non-orthogonal STBC 
from CDA with Nt = N r , 4-QAM, rate-3/4 turbo code, and 18 bps/Hz with 
perfect CSIR and estimated CSIR. 

scheme. Transmission is carried out in frames, where one 
N t x Nt pilot matrix (for training purposes) followed by 
Nd data STBC matrices are sent in each frame [23|. One 
frame length, T, (taken to be the channel coherence time) 
is T = (Nd + l)Nt channel uses. The proposed scheme works 
as follows: i) obtain an MMSE estimate of the channel matrix 
during the pilot phase, H) use the estimated channel matrix 
to decode the data STBC matrices using RTS, Hi) use the 
decoded STBCs to estimate the channel matrix again, and iv) 
iterate between channel estimation and RTS decoding for a 
certain number of times. For 12 x 12 STBC from CDA, in 
addition to perfect CSIR performance, Fig. [7] also shows the 
performance with CSIR estimated using the above iterative 
RTS decoding/channel estimation scheme for Nd = 8 and 
Nd = 20. 2 iterations between RTS decoding and channel 
estimation are used. With Nd — 20 (which corresponds to 
large coherence times, i.e., slow fading) the BER and bps/Hz 
with estimated CSIR get closer to those with perfect CSIR. 

D. Effect of MIMO Spatial Correlation 

In all the previous performance and complexity plots, we 
assumed i.i.d fading. But spatial correlation at transmit/receive 
antennas and the structure of scattering and propagation envi- 
ronment can affect the rank structure of the MIMO channel 
resulting in degraded performance [25], [26]. We relaxed the 
i.i.d. fading assumption by considering the correlated MIMO 
channel model proposed by Gesbert et al in 11261 . which 
takes into account carrier frequency (/ c ), spacing between 
antenna elements (l t , l r ), distance between transmit and receive 
antennas (£>), and scattering environment. In Fig. [8] we plot 
the uncoded BER of RTS decoding of 12 x 12 STBC from 
CDA with perfect CSIR in i) i.i.d. fading, and ii) correlated 
MIMO fading model in ll26l . It is seen that, compared to 
i.i.d fading, there is a performance loss in spatial correlation 
for N t — N r — 12; further, use of more receive antennas 
(N r = 14, Nt = 12) alleviates this loss in performance. 
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Fig. 8. Effect of spatial correlation on the performance of RTS decoding 
of 12 X 12 STBC from CDA with N t = 12, N r = 12, 14, 4-QAM, rate-3/4 
turbo code, 18 bps/Hz. f c = 5 GHz, D = 500 m, 5 = 30, D t = D r = 20 
m, 6 t = r = 90°, N r l r = Nth = 72 cm. 



V. RTS Equalizer for MIMO-ISI Channels 

In this section, we consider the adoption and performance 
of the RTS algorithm in another communication scheme, 
where large dimensions are created in time due to the highly 
frequency selective nature of the channel, i.e., large number 
(tens to hundreds) of multipath components (MPC), as can 
typically happen in UWB channels [17|,[20|. 

Consider a frequency-selective MIMO channel with Nt 
transmit and N r receive antennas (Fig. |9}. Let L denote the 
number of MPCs. Data is transmitted in frames, where each 
frame has K data symbols preceded by a cyclic prefix (CP) 
of length L symbols, K > L. While CP avoids inter-frame 
interference, there will be ISI within the frame. Let x g € A Nt 
be the transmitted symbol at time q, < q < K — 1, where A 
is the transmit symbol alphabet, which is taken to be M-QAM. 
The received signal vector at time q can be written as 

L-l 

y q = ^H ; x 9 _ ( +w g , g = (),■■• ,K- 1,(21) 

1=0 

where y q £ C NrXl , H; e Q N r><m j s ^ c h a nnel gain matrix 
for the Zth MPC. The entries of H; are assumed to be random 
with distribution CJV(0, 1). It is further assumed that Hf, I = 
0, • • • , L — 1 do not change for one frame duration. w q 6 
C NrXl is the additive white Gaussian noise vector at time q, 
whose entries are independent, each with variance Nq. The 
CP will render the linearly convolving channel to a circularly 
convolving one, and so the channel will be multiplicative in 
frequency domain. Because of the CP, the received signal in 
frequency domain, for the ith frequency index (0 < i < K—V), 
can be written as 



G, u, 



(22) 



K-l 

where r 4 = 4= £ 



1, 
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Fig. 9. MIMO-ISI channel model. 
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where p q i — e R , is the A"-point DFT matrix and 
(g) denotes the Kronecker product. The received signal model 
in (l23l can be rewritten in real form with dt = 2NtK and 
d r = 2N r K. RTS algorithm is applied on this real-valued 
system model. 

Initial vector using FD-MMSE equalizer: The detected 
symbol vector obtained using frequency domain (FD) MMSE 
equalization can be used as the initial vector to the RTS 
algorithm. The FD-MMSE equalizer on the ith frequency 
employs MMSE nulling as 

u t - (GfGi + ^Iiv^Gfr;, 0<t<K-®4) 

where E s is the average energy of a transmitted symbol. The 
U;'s are transformed back to time domain using X-point IDFT 
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to obtain an estimate of the transmitted symbol vector as 

1 K ^ 

x g = —=^6^2,, Q<q<K-l, (25 
which are used to form the initial vector to the RTS algorithm 

A. Performance Results and Discussions 

We evaluated the BER performance of the proposed RT5 
equalizer in a 4 x 4 MIMO V-BLAST system with 4-QA1V 
as a function of average E^/Nq per receive antenna, througl 
simulations. We have assumed uniform power delay profile 
(i.e., all the L paths are assumed to be of equal energy) 
We evaluated the performance for various number of dela\ 
paths, L, and frame sizes, K, keeping L/K constant. It ii 
noted that the system becomes a 'large-dimension system 
when L and K are increased keeping L/K fixed. The FD 
MMSE equalizer output is used as the initial vector for 
both RTS and LAS. The following RTS parameters are used: 
Pq = 2; (3 = l;ai = 0.03; maxjrep = 75;min_iter = 30. 
For K = 64 and 128, maxjter = 300 and a 2 = 0.00075. 
For K = 512, maxjter = 500 and a 2 = 0.0004. 

In Fig. [10] we plot the uncoded BER of the RTS equalizer 
for (L = 6,K = 64), (L = 12, K = 128), and (L = 48, K = 
512), L/K = 0.09375. Note that for (L = 48, K = 512), 
the number of transmit dimensions is d t = 2N t K = 2x4x 
512 = 4096 dimensions. Since FSD and RSD complexities are 
prohibitive for number of dimensions in the thousands, we do 
not give their performances. In addition to the performance 
of RTS, we have given the performance of i) the FD-MMSE 
equalizer (without any subsequent search), ii) LAS equalizer, 
and Hi) single-input multiple-output (SIMO) AWGN with 
N r = 4 (which can be viewed as a good lower bound on 
the best detector performance). It is seen that the performance 
of the FD-MMSE equalizer is poor. However, the subsequent 
search operations carried out in RTS and LAS result in 
significantly improved performance for increasing L, K. Both 
RTS and LAS show large-dimension behavior in this system 
also (i.e., BER improves for increasing L, K, keeping L/K 
fixed). For a given L, RTS performs better than LAS. For e.g., 
at 10~ 3 BER, RTS performs better by about 1.5 dB and 0.8 dB 
compared LAS for (L = 6, K = 64) and (L = 12,K = 128), 
respectively. We note that the per-symbol complexity of FD- 
MMSE (i.e., initial vector) computation is 0(KN t +Nf). The 
per-symbol complexity of H T H computation is 0(K 2 N t ). 
The per-symbol search complexities for RTS, obtained by 
simulations, is 0(KN t ). So the overall per-symbol complexity 
of the RTS equalizer is 0(K 2 N t ) + 0{KN t + iV t 2 ). 

VI. Conclusions 

We conclude by highlighting some recent trends in high 
spectral efficiency MIMO systems/measurements with large 
number of antennas to bring out the contextual importance 
and relevance of the work presented in this paper. 1) NTT 
DoCoMo has already field demonstrated a 12 x 12 V-BLAST 
system operating at 5 Gbps data rate and 50 bps/Hz spectral 
efficiency in 4.6 GHz band at a mobile speed of 10 Km/hr |29|. 



10° 




Eb/NO (dB) 



Fig. 10. Comparison of the BER performance of the proposed RTS equalizer 
with those of LAS equalizer and FD-MMSE equalizer in a 4 X 4 V-BLAST 
system with 4-QAM for different number of MPCs (L) and frame sizes (K), 
keeping L/K constant. Uniform power delay profile. 

2) Evolution of WiFi standards (evolution from IEEE 802.1 In 
to IEEE 802.1 lac to achieve multi-gigabit rate transmissions 
in 5 GHz band) now considers 16 x 16 MIMO operation; e.g., 
see 16 x 16 MIMO indoor channel sounding measurements at 
5.17 GHz reported in [ 30 1 for consideration in WiFi standards. 

3) 64 x 64 MIMO channel sounding measurements at 5 
GHz in indoor environments have been reported in [ 3 1 1 . We 
note that, while the RF/antenna technologies/measurements 
for large-MIMO systems are getting matured, there is lack 
of current focus on development of low-complexity baseband 
algorithms for detection and channel estimation for large- 
MIMO systems (MIMO systems with 16 or more antennas) 
to reap their high spectral efficiency benefits. A vast body 
of MIMO detection literature is heavily focused on 4 x 4 
(in some cases 8x8) MIMO. Algorithms suited for large- 
MIMO signal detection and their performance have started 
appearing in the literature recently (e.g., [22|,[23)). Here, 
we showed that the RTS algorithm presented in this paper 
achieves even better performance than the LAS algorithm 
presented in ||231 (e.g., 6 dB better performance in 32 x 32 V- 
BLAST with 16- and 64-QAM in Fig. |5]>. We also showed 
that the considered sphere decoding variants (FSD, RSD) 
either performed poorly and/or did not scale well for large- 
dimension detection (e.g., see 32 x 32 V-BLAST plots and 
complexities in Fig. [4] and Table 1). The large-dimension 
behavior of the RTS algorithm has other potential applications, 
like the low-complexity equalization in severely delay-spread 
UWB systems (with thousands of dimensions) presented in 
this paper. Finally, we note that algorithms for low-complexity, 
high-performance large-dimension signal processing for com- 
munication applications is a promising research direction. 
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